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Abstract 

Stochastic Dual Coordinate Ascent is a popular method for solving regularized loss minimization for 
the case of convex losses. In this paper we show how a variant of SDCA can be applied for non-convex 
losses. We prove linear convergence rate even if individual loss functions are non-convex as long as the 
expected loss is convex. 


1 Introduction 


The following regularized loss minimization problem is associated with many machine learning methods: 


min P{w) := — + — \\w 

n 2 

2 = 1 


One of the most popular methods for solving this problem is Stochastic Dual Coordinate Ascent (SDCA). 
[8] analyzed this method, and showed that when each cj)i is L-smooth and convex then the convergence rate 
of SDCA is d((L/A + n) log(l/e)). 

As its name indicates, SDCA is derived by considering a dual problem. In this paper, we consider 
the possibility of applying SDCA for problems in which individual tfn are non-convex, e.g., deep learning 
optimization problems. In many such cases, the dual problem is meaningless. Instead of directly using the 
dual problem, we describe and analyze a variant of SDCA in which only gradients of (pi are being used 
(similar to option 5 in the pseudo code of Prox-SDCA given in fS])- Following 0], we show that SDCA is a 
variant of the Stochastic Gradient Descent (SGD), that is, its update is based on an unbiased estimate of the 
gradient. But, unlike the vanilla SGD, for SDCA the variance of the estimation of the gradient tends to zero 
as we converge to a minimum. 

For the case in which each pi is L-smooth and convex, we derive the same linear convergence rate 
of 0{{L/X + n) log(l/e)) as in lUl, but with a simpler, direct, dual-free, proof. We also provide a linear 
convergence rate for the case in which individual pi can be non-convex, as long as the average of pi are 
convex. The rate for non-convex losses has a worst dependence on L/A and we leave it open to see if a 
better rate can be obtained for the non-convex case. 


Related work: In recent years, many methods for optimizing regularized loss minimization problems 
have been proposed. For example, SAG ||5l, SVRG lO, Finito [21, SAGA |2, and S2GD [4]. The best 
convergence rate is for accelerated SDCA [6]. A systematic study of the convergence rate of the different 
methods under non-convex losses is left to future work. 
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2 SDCA without Duality 


We maintain pseudo-dual vectors ai,..., an, where each at G 

Dual-Free SDCA (P,T,r],a^^^) 

Goal: Minimize P{w) = ^ 4>iiw) + |||m|p 
Input: Objective P, number of iterations T, step size rj s.t. /3 := rjXn < 1, 
initial dual vectors , • • •, CKn*^ 

Initialize: ^ Yl?=i 

For t = 1,..., T 

Pick i uniformly at random from [n] 

Update: af'^ = af~^^ — rjXn 

Update: — rj (v+ af~^^'j 

Observe that SDCA keeps the primal-dual relation 


w 


P-i) - 


1 

— V 

Xn^ * 
2 = 1 


Observe also that the update of a can be rewritten as 

= (1 - + /3 , 

namely, the new value of ai is a convex combination of its old value and the negation of the gradient. Finally, 
observe that, conditioned on the value of and we have that 

E[u;(*)] = - r] (vE^i{w^^-^^) + Eaf 




— r/(v— 4>i{w^^ + 

V ” i=l 


That is, SDCA is in fact an instance of Stochastic Gradient Descent. As we will see in the analysis section 
below, the advantage of SDCA over a vanilla SGD algorithm is because the variance of the update goes to 
zero as we converge to an optimum. 


3 Analysis 


The theorem below provides a linear convergence rate for smooth and convex functions. The rate matches 
the analysis given in [8], but the analysis is simpler and does not rely on duality. 

Theorem 1. Assume that each (/>j is L-smooth and convex, and the algorithm is run with rj < Let w* 

be the minimizer of P{w) and let a* = —V(f)i{w*). Then, for every f > 1, 


E 




— - , y 

2 2Ln ^ 

2=1 


— a* 






-h 


2Ln 


Eiii“ 


(0) 


2=1 
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In particular, setting rj = then after 


T>n 



iterations we will have — P{w*)] < e. 

The theorem below provides a linear convergence rate for smooth functions, without assuming that 
individual are convex. We only require that the average of cpi is convex. The dependence on L/A is worse 
in this case. 


Theorem 2. Assume that each (j)i is L-smooth and that the average function, ^ 4>i, is convex. Let w* 

be the minimize r of P{w) and let a* = —V4>i{w*). Then, if we run SDCA with p = minj^^ , we 

have that 


E 


-^11 (t) *ii2 I d) *ii2i 

^ - rn II + Z^ml - «* || ] 


2L‘^n 


i=l 


< e 


— TjXt 


A 


A 


^11 ( 0 ) *\\2 
- ^ — It ' I 

2" " 2L^n 




2=1 


It follows that whenever 


we have that — P{w*)] < e. 


7" > ^ ^ 


3.1 SDCA as variance-reduced SGD 

As we have shown before, SDCA is an instance of SGD, in the sense that the update can be written as 
— pvt, with vt = V + af satisfying E[nt] = VP{w^^~^^). 

The advantage of SDCA over a generic SGD is that the variance of the update goes to zero as we 
converge to the optimum. To see this, observe that 

E[||ntf ] = E[||af“^) + = E[||af"^) - a* + a* + 

< 2E[||af"^) - a*||2] + 2E[|| - - a*f] 

Theorem[^(or Theorem|^ tells us that the term E[||a|* — a||p] goes to zero as For the second 

term, by smoothness of (/>j we have IIa* II = \\SI(l)i{w (i-i))_V</,i(u;*)|| < L||n;(*-i)-u;*||, 
and therefore, using Theorem[^(or Theorem]^ again, the second term also goes to zero as All in all, 

when t > Q log(l/e)^ we will have that E[||r;i|p] < e. 

4 Proofs 

Observe that 0 = 'VP{w*) = ^ + \w*, which implies that w* = ^ 

Define Ui = — and vt = —Ui + af We also denote two potentials: 

n •' ■' 

i=i 
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We will first analyze the evolution of At and Bt. If on round t we update using element i then = 
(1 — P)af + Pui, where {3 = rjXn. It follows that, 


A A ^ II (*) * I |2 ^11 (i- 1 ) * ||2 

At - At-i =-Way-ttiW - a- - aj 

n n * 


1 


= ^ll(l-^)(«f - a*) + I3{ui - a* 




( 1 ) 


= ^ ((1 - a*W^ + PWui - a*W'^ - p{l - P)Waf~^^ -Uif - -a^f) 

= ^ - «i IP + \\ui - a*W^ - (1 - /3)||ut|p^ 

= r]\ - a-f + Wui - a*f - (1 - /3)l|^^if) • 


In addition, 


Bt-Bt-i = ||u;W + 772 ||^;j 2 ^ ^2) 

The proofs of Theorem [T] and Theorem]^ will follow by studying different combinations of At and Bt. 


4.1 Proof of Theorem 

Define 

jyAt + Bt 

Combining ([T]l and Q we obtain 



Ct-i - Ct 





= r]X 


X 



lUj — a. 


* 112 


+ (i- 


Vt\ 


+ 


12 II * 112 

I — \\ui — II 


+ 


A(l-/3) 

2L2 


- 2r]{w^^ - w*yvt- q'^WvtW'^ 

WvtW^ + iw^'~"^ -w*yvt 


The definition of ?7 implies that ry < A(l —/3)/L^, so the coefficient of is non-negative. By smoothness 

of each (pi we have ||uj — a*|p = WX/4>i{w^^~^'^) — V(^i(r(;*)|p < —w*W'^. Therefore, 


Ct-i -Ct > riX 


A II (t-i) *||2 A 

- Q/. ' — OL- W — —\\W^ ' — W 

2jji II ^ ^11 2 


+ {w 


(i-1) _ 


Taking expectation of both sides (w.r.t. the choice of i and conditioned on and a^^ and noting 

thatEH = we obtain that 


E[a-i - Ct] > vX 








Using the strong convexity of P we have — w*yVP{w^^ > P{w^^ — 

m* IP and — P(r(;*) > A ||ty(t-i)— ||2^ ^vdich together yields — w*y'VP{w^^~^^) > 
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1) _ 'u;*||2_ Therefore, 


E[Ct-i - Ct] 


> r/A 


A 


E||a, 


(t-i) 


— a,- 


+ 








It follows that 

nCt] < (1 - rjX)Ct-i 
and repeating this recursively we end up with 

E[Ci] < (1 - r/A)*Co < , 

which concludes the proof of the first part of Theorem The second part follows by observing that P is 
{L + A) smooth, which gives P{w) — P{w*) < ^^\\w — 


4.2 Proof of Theorem [T] 

In the proof of Theoremj^we bounded the term \\ui — a* |p by — w*\\'^ based on the smoothness 

of (jji. We now assume that (pi is also convex, which enables to bound \\ui — a||p based on the current 
sub-optimality. 

Lemma 1. Assume that each (pi is L-smooth and convex. Then, for every w, 

i ^ \\VUw) - VPiiw*)f < 2L (^P(w) - P(w*) - . 

Proof. For every i, define 


gi{w) = pi{w) - pi{w*) - Vpi{w*y {w - w*) . 

Clearly, since pi is L-smooth so is gi. In addition, by convexity of pi we have gi{w) > 0 for all w. It follows 
that gi is non-negative and smooth, and therefore, it is self-bounded (see Section 12.1.3 in 1171): 

||Vpi(m)|p < 2Lgi{w) . 


Using the definition of gi, we obtain 

Pi{w) - pi{w*)-Vpi{w*y{w - w*) . 

Taking expectation over i and observing that P{w) = Kpi{w) + ^||ru|p and 0 = VP{w*) = KVpi{w*) + 
Xw* we obtain 


\\VPi{w) - VPiiw*)f = ||Vff,(n;)f < 2Lgi{w) = 2L 


E\\Vp,iw)-VPi{w*W <2L 


= 2L 


P{w) - - P{w*) + + Xw*^{w - w*) 


P{w) — P{w*) — -\\w — w*\\‘^ 


□ 
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We now consider the potential 


Dt = —At + -Bt . 
2L 2 


Combining ([T]l and Q we obtain 


A-i — Dt 

= rj\ 

> rjX 


- - \\ui - a*\\^ + (1 - /?)lkt||^) + - w*)'^vt - 


1 

iL 

1 

iL 


(t-l) *||2 II *||2 

a) — oii — \\Ui — a,- 


+ 


A 
2 

~ 2 ^ ^ - w*)^vt 


{t-l) *||2 II 

a) — aJ — \\ui — a 


i 11^) + “ w*)^vt 


where in the last inequality we used the assumption 


rj < 


1-/3 

L 


Take expectation of the above w.r.t. the choice of i, using LemmallJ using E[ut] = VP{w^^ and using 

convexity of P that yields P{w*) — P{w^^~^^) > {w* — P{w^^~^'^), we obtain 

E[A-i - Dt] 

^ (^E||Q:f“^^ - a*\\^ - E||rti - a*|p^ + - t(;*)'^Euj 

-^E||q;-*~^^ — a*\\‘^ — (— P{w*) — — w*)^VP{w^^~^'') 


> TjX 

> rjX 

> rjX 


2L 


^117 11 (*-l) * Il2 I II (t-l'l * ||2 

—E||a) -a* II + -||u;^ ^ II 


= -T]XDt-i 


This gives E[iAt] < (1 — r]X)Dt-i < e ^^Dt-i, which concludes the proof of the first part of the theorem, 
he s( 

* 112 


The second part follows by observing that P is (L + A) smooth, which gives P{w) — P{w*) < — 


w 
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